{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Quantization of a Finetuned BERT Model with HF's Optimum\n",
        "This notebook is a companion of chapter 5 of the \"Domain Specific LLMs in Action\" book, author Guglielmo Iozzia, [Manning Publications](https://www.manning.com/), 2024.  \n",
        "The code in this notebook is to introduce readers to the quantization of an encoder-only language model, [distilbert-base-uncased-finetuned-banking77](https://huggingface.co/optimum/distilbert-base-uncased-finetuned-banking77) using the Hugging Face's [Optimum](https://github.com/huggingface/optimum) library. It doesn't require hardware acceleration.  \n",
        "More details about the code can be found in the related book's chapter."
      ],
      "metadata": {
        "id": "ad3y3Qr2z-5D"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Install the missing requirements in the Colab VM (only the latest HF's Optimum for the ONNX runtime and Evaluate)."
      ],
      "metadata": {
        "id": "HaeX1dd_H3WR"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install optimum[onnxruntime] evaluate"
      ],
      "metadata": {
        "id": "FNHsrOLCqqKE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Force the downgrade of the HF's Dataset package to release 3.6.0 because of the following [bug](https://github.com/huggingface/datasets/issues/7693) introduced in the latest version 4.0.0 (installed by default in the Colab VMs). A runtime restart would be needed when completed."
      ],
      "metadata": {
        "id": "j7dxuFQUEIOR"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install --force-reinstall datasets==3.6.0"
      ],
      "metadata": {
        "id": "enLAfFClCLqX"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Import the required classes."
      ],
      "metadata": {
        "id": "fK6DUXOs1V9l"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from pathlib import Path\n",
        "from optimum.onnxruntime import ORTModelForSequenceClassification\n",
        "from transformers import AutoTokenizer"
      ],
      "metadata": {
        "id": "V5sHCPgv1aBa"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Load the distilbert base uncased finetuned model from the HF Hub and convert it to ONNX (fp32). Then save it and the associated tokenizer to disk."
      ],
      "metadata": {
        "id": "WEPs5msnMCLw"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "model_id=\"optimum/distilbert-base-uncased-finetuned-banking77\"\n",
        "onnx_path = Path(\"onnx\")\n",
        "\n",
        "model = ORTModelForSequenceClassification.from_pretrained(model_id,\n",
        "                                                          export=True)\n",
        "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
        "\n",
        "model.save_pretrained(onnx_path)\n",
        "tokenizer.save_pretrained(onnx_path)"
      ],
      "metadata": {
        "id": "9J4p_iA2XCGY"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Check that the downloaded model works as expected. Transformers' pipelines are supported in Optimum."
      ],
      "metadata": {
        "id": "YlAYQLphXS8J"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import pipeline\n",
        "\n",
        "vanilla_clf = pipeline(\"text-classification\", model=model, tokenizer=tokenizer)\n",
        "vanilla_clf(\"Could you assist me in checking my card validity?\")"
      ],
      "metadata": {
        "id": "Y-ke7ET-XOay"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Quantize the model dynamically. First create an ORTQuantizer instance and define the quantization configuration. Then apply the quantization configuration to the model."
      ],
      "metadata": {
        "id": "wMELF2cZXsSb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from optimum.onnxruntime import ORTQuantizer\n",
        "from optimum.onnxruntime.configuration import AutoQuantizationConfig\n",
        "\n",
        "dynamic_quantizer = ORTQuantizer.from_pretrained(model)\n",
        "dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False,\n",
        "                                              per_channel=False)\n",
        "\n",
        "model_quantized_path = dynamic_quantizer.quantize(\n",
        "    save_dir=onnx_path,\n",
        "    quantization_config=dqconfig,\n",
        ")"
      ],
      "metadata": {
        "id": "_Hw1RLL3XvzL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Compare the size of the downloaded ONNX model and its quantized version."
      ],
      "metadata": {
        "id": "TIRVmiVv3HEb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "\n",
        "original_model_name = \"model.onnx\"\n",
        "quantized_model_name = \"model_quantized.onnx\"\n",
        "size = os.path.getsize(onnx_path / original_model_name)/(1024*1024)\n",
        "quantized_model = os.path.getsize(onnx_path / quantized_model_name)/(1024*1024)\n",
        "\n",
        "print(f\"Original Model file size: {size:.2f} MB\")\n",
        "print(f\"Quantized Model file size: {quantized_model:.2f} MB\")"
      ],
      "metadata": {
        "id": "GxM_lX8ZX_Ct"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Check that inference with the quantized model works as expected."
      ],
      "metadata": {
        "id": "pRkKmDh4YF6P"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from optimum.onnxruntime import ORTModelForSequenceClassification\n",
        "from transformers import pipeline, AutoTokenizer\n",
        "\n",
        "model = ORTModelForSequenceClassification.from_pretrained(onnx_path,\n",
        "                                                file_name=quantized_model_name)\n",
        "tokenizer = AutoTokenizer.from_pretrained(onnx_path)\n",
        "\n",
        "q8_clf = pipeline(\"text-classification\",model=model, tokenizer=tokenizer)\n",
        "\n",
        "q8_clf(\"Could you assist me in checking my card validity?\")"
      ],
      "metadata": {
        "id": "SYcRR9_EYAS5"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Models' performance evaluation"
      ],
      "metadata": {
        "id": "M-J7HOMtYKzz"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Download the test set of the Banking 77 dataset (available in the HF's Hub)."
      ],
      "metadata": {
        "id": "_6MUclsYAIYr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from evaluate import evaluator\n",
        "from datasets import load_dataset\n",
        "\n",
        "dataset_id=\"PolyAI/banking77\"\n",
        "eval = evaluator(\"text-classification\")\n",
        "eval_dataset = load_dataset(dataset_id, split=\"test\")"
      ],
      "metadata": {
        "id": "hGAC5fK7AKKe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Evaluate the quantized model towards the downloaded test set (the HF's Evalate library is used)."
      ],
      "metadata": {
        "id": "BDFiEunhEep0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "results = eval.compute(\n",
        "    model_or_pipeline=q8_clf,\n",
        "    data=eval_dataset,\n",
        "    metric=\"accuracy\",\n",
        "    input_column=\"text\",\n",
        "    label_column=\"label\",\n",
        "    label_mapping=model.config.label2id,\n",
        "    strategy=\"simple\",\n",
        ")\n",
        "print(results)"
      ],
      "metadata": {
        "id": "QmayEVJsEcBz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Compare the test scores across the original ONNX model and its quantized version."
      ],
      "metadata": {
        "id": "DrnN6HtK7-WI"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(f\"Vanilla model: 92.5%\")\n",
        "print(f\"Quantized model: {results['accuracy']*100:.2f}%\")\n",
        "print(f\"The quantized model achieves {round(results['accuracy']/0.925,4)*100:.2f}% accuracy of the fp32 model\")"
      ],
      "metadata": {
        "id": "iO54iqZxAUnv"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Define a function to benchmark the execution times for both models."
      ],
      "metadata": {
        "id": "F7JTMiF1AaQb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from time import perf_counter\n",
        "import numpy as np\n",
        "\n",
        "def measure_latency(payload_prompt, pipe):\n",
        "    latencies = []\n",
        "    # Warm up\n",
        "    for _ in range(10):\n",
        "        _ = pipe(payload_prompt)\n",
        "    # Effective runs\n",
        "    for _ in range(300):\n",
        "        start_time = perf_counter()\n",
        "        _ =  pipe(payload_prompt)\n",
        "        latency = perf_counter() - start_time\n",
        "        latencies.append(latency)\n",
        "\n",
        "    time_avg_ms = 1000 * np.mean(latencies)\n",
        "    time_std_ms = 1000 * np.std(latencies)\n",
        "    time_p95_ms = 1000 * np.percentile(latencies,95)\n",
        "\n",
        "    return f\"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\\\\- {time_std_ms:.2f};\", time_p95_ms"
      ],
      "metadata": {
        "id": "Z7Iq5iP9AeIm"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Benchmark the two versions of the model and compare the results."
      ],
      "metadata": {
        "id": "DCAUR0Fz8nYJ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "prompt=\"Dear Sir/Madam, my name is William. I am getting in touch because I didn't get a response from you yet. What actions do I need to do to get my new card which I have requested 3 weeks ago? Please help me and answer this email as soon as possible. Have a nice rest of the day. Best Regards.\"*2\n",
        "print(f'Prompt length: {len(tokenizer(prompt)[\"input_ids\"])}')\n",
        "\n",
        "original_model_stats = measure_latency(prompt, vanilla_clf)\n",
        "quantized_model_stats = measure_latency(prompt, q8_clf)\n",
        "\n",
        "print(f\"Vanilla model: {original_model_stats[0]}\")\n",
        "print(f\"Quantized model: {quantized_model_stats[0]}\")\n",
        "print(f\"Improvement through quantization: {round(original_model_stats[1]/quantized_model_stats[1],2)}x\")"
      ],
      "metadata": {
        "id": "4Cyxwn9k8ij8"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}